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ABSTRACT 

We present a Principal Component Analysis (PCA)-based spectral classification, 77, for the first 5600 
galaxies observed in the DEEP2 Redshift Survey. This parameter provides a very pronounced separation 
between absorption and emission dominated galaxy spectra - corresponding to passively evolving and 
actively star-forming galaxies in the survey respectively. In addition it is shown that despite the high 
resolution of the observed spectra, this parameter alone can be used to quite accurately reconstruct any 
given galaxy spectrum, suggesting there are not many 'degrees of freedom' in the observed spectra of this 
galaxy population. It is argued that this form of classification, 77, will be particularly valuable in making 
future comparisons between high and low-redshift galaxy surveys for which very large spectroscopic 
samples are now readily available, particularly when used in conjunction with high-resolution spectral 
synthesis models which will be made public in the near future. We also discuss the relative advantages of 
this approach to distant galaxy classification compared to other methods such as colors and morphologies. 
Finally, we compare the classification derived here with that adopted for the 2dF Galaxy Redshift Survey 
and in so doing show that the two systems are very similar. This will be particularly useful in subsequent 
analyses when making comparisons between results from each of these surveys to study evolution in the 
galaxy populations and large-scale structure. 

Subject headings: Galaxies: high-redshift — galaxies: evolution 



1. INTRODUCTION 

The classification of galaxies is of fundamental impor- 
tance for understanding galaxy populations, and for this 
reason is a very important aspect of any galaxy redshift 
survey. Having a data set of many thousands of galaxy 
spectra allows one to test the validity of galaxy forma- 
tion and evolution scenarios with unprecedented accuracy. 
However, the sheer size of the full spectral data set presents 
its own unique problems. In order to make such a galaxy 
data set more 'digestible' some form of data compres- 
sion is necessary, whether this be through the adoption of 
morphological segregation, colors or some other compres- 
sion/classification scheme. If these quantities (and their 
associated distributions) can be determined consistently 
over a wide range of redshifts, they can be compared with 
theoretical predictions and simulations, and hence set con- 
straints on scenarios for galaxy evolution. This will be 
especially true if consistent classification regimes can be 
used for both the high redshift (z ~ 1) surveys currently 
underway (the DEEP2 Galaxy Redshift Survey, Davis et 
al. 2002 and the VIRMOS-VLT Deep Survey, Le Fevre et 
al. 1999) - and the large z ~ surveys now approach- 
ing full completion; the Sloan Digital Sky Survey (SDSS, 
Strauss et al. 2002) and the 2dF Galaxy Redshift Survey 
(2dFGRS, Colless et al. 2001) 

A number of different approaches to the classification 
of galaxy spectra have been adopted for local galaxy sur- 
veys. These include the calculation of rest-frame colors 
(e.g. Strateva et al. 2001); principal component analysis 
(PCA) based spectral classifications (e.g. Connolly et al. 
1995; Bromley et al. 1998; Folkes et al. 1999; Madgwick 
et al. 2002; de Lapparent et al. 2003), and other more 



sophisticated discriminations (e.g. Heavens, Jimenez & 
Lahav 2000; Slonim et al. 2001), based upon informa- 
tion theory. The underlying theme of all these alternative 
methods is that they characterize the galaxy population 
exclusively in terms of their observed spectra. The work 
presented here is the first attempt to apply one of these 
methods (PCA) to the classification of such a distant sam- 
ple of galaxies. 

1.1. The role of spectral classification 

There are three methods which have generally proved to 
be popular for the classification of galaxies: morphological 
segregation, rest-frame colors and direct spectrum based 
classifications. Each of these methods has its own unique 
drawbacks and advantages. 

To understand galaxy evolution out to redshifts of z ;> 1, 
it is essential to have a consistent implementation of these 
classifications over a wide range of look-back times. For 
this reason it can be argued that morphological segregation 

- although perhaps the most natural form of classification 

- may not be the optimal solution to adopt over such a 
large range of redshifts. This is due to both the degra- 
dation of morphology with redshift and the absence of a 
robust and repeatable methodology to perform this clas- 
sification (see e.g. Conselice 2003 for further discussion 
and possible solutions to this situation). For this reason 
we focus in this paper on alternative classification meth- 
ods to morphology, which we hope will complement earlier 
studies based on this method, whilst at the same time pro- 
viding a new perspective by more directly reflecting the 
physical properties of each galaxy. 

The remaining two options - rest-frame colors and spec- 
tral 'types' - are linked, in that they both provide some 
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compressed representation of the observed galaxy's spec- 
tral energy distribution (SED), hence providing a rela- 
tively direct insight into physical processes such as star- 
formation currently occurring in each galaxy. However, in 
terms of how each is calculated for high-z galaxies there are 
significant differences which are important to understand. 

The main complication for the calculation of rest-frame 
colors is the accurate determination of fc-corrections to ac- 
count for the different pass-bands sampled by each photo- 
metric filter at different redshifts. These generally cannot 
be estimated from the observed spectrum, but rather must 
be determined by matching each observed galaxy to a set 
of template SEDs with full rest-frame wavelength cover- 
age. 

In the case of spectral classification, one must consider 
that the rest-frame wavelength coverage varies with the 
redshift of the galaxy, hence to adopt a uniform classifi- 
cation over a large range of redshifts one needs to signifi- 
cantly restrict the rest-frame wavelength range considered 
(e.g. by focusing in on particular line features through 
equivalent width measurements) or to determine some way 
of filling in the gaps that are not observed in each spec- 
trum. 

These two problems are very closely related in that they 
express the need to determine the form of a galaxy's SED 
over a wavelength interval that is not necessarily observed. 
In the case of fc-corrections only ~ 10 template SEDs are 
available for this task (Kinney et al. 1996; Coleman, Wu 
& Weedman 1980); reddening is manually incorporated 
and we have little control over the wavelength intervals 
involved. However as will be shown in this paper, with 
principal component analysis (PCA), we can achieve this 
task for spectral classification, using the observed spectra 
themselves as templates, giving us ~ iVg a i galaxy tem- 
plates, with which to interpolate or extrapolate a given 
observed spectrum. In addition, because we have much 
more control over the wavelength interval adopted for the 
classification we can modify the analysis to use only the 
parts of the observed spectra which have the most uniform 
rest-frame wavelength coverage. 

It is primarily for this latter reason that adopting PCA- 
based spectral types will provide a classification scheme 
that is particularly uniform over the large range of red- 
shifts encountered in the DEEP2 redshift survey, hence 
providing a robust probe of evolution in the galaxy pop- 
ulation. In addition we note that such methods of clas- 
sification are timely, in that there is now a considerable 
body of low redshift spectroscopic data from e.g. 2dFGRS 
and SDSS, which enables us to make detailed comparisons. 
We note that PCA-based classifications do suffer from one 
particular drawback, which is that they are not as straight- 
forward to interpret as other classifications, and this is an 
issue we will attempt to address in this paper. 

The outline of this paper is as follows: In Section 2 we 
briefly discuss the DEEP2 Redshift Survey data that we 
will use in this analysis. Section 3 describes the imple- 
mentation of PCA we have adopted for this paper, and 
gives a discussion as to why this particular formalism has 
been used. In particular, issues regarding the restframe 
wavelength coverage arc discussed in some detail. In Sec- 
tion 4 we discuss the method of spectral classification we 
will adopt, based upon the results of the PCA we have im- 
plemented. This spectral classification is contrasted with 



that of the 2dF Galaxy Redshift Survey in Section 5, in 
which the selection effects of the two surveys are also dis- 
cussed. We then conclude this paper in Section 6 with a 
brief discussion as to future applications of this work. 

2. DEEP2 GALAXY SPECTRA 

In its first season of observations (August - October, 
2002), the DEEP2 Redshift Survey has already accurately 
measured the redshifts of ~ 5600 galaxies, out of a pro- 
posed total of 60,000. These galaxies have been pre- 
selected to have z ^ 0.7 and an Rab limiting magnitude 
of 24.1, from a set of B,R and I CFHT 12k x 8k images 
covering ~ 4 deg 2 on the sky. Foreground galaxies have 
been excluded using a simple photometric cut, based upon 
the observed R — I and B — R color of each galaxy (Davis 
et al. 2002). In this paper we make use of all this data 
from the first observing season. 

A sophisticated automated pipeline has been developed 
to efficiently extract and reduce the spectra observed in the 
survey, details of which will be presented by Davis et al. 
(2003). The observed spectra themselves are taken at high 
resolution (R ~ 5000) using the DEIMOS spectrograph 
mounted on the 10-m Keck-II telescope (Davis & Faber, 
1998), and generally span the wavelength range of 6400 < 
A < 9200A. For galaxies with z > 0.7 this allows us to very 
accurately measure the redshift of each galaxy, particularly 
when the resolved [On] doublet is present (out to a redshift 
of z = 1.5). Absorption based redshifts are also readily 
determined, primarily using the Ca H+K features which 
are visible to redshifts of z ~ 1.3. 

All spectra used in this analysis have been corrected for 
the DEIMOS throughput efficiency of the 1200 line/mm 
grating used in the survey 7 , and are presented in units of 
counts/pixel/hour. Note that the flux calibration is only 
approximate at present (~10%), however we have found 
the components derived from our PCA to be robust to 
the exact value of the throughput efficiency. Additionally, 
the spectra have been normalized to have a mean count 
of unity, after being minimally smoothed (3 pixels ~ 0.9A 
observed-frame, ~ 0.5A restframe) according to the in- 
verse of their estimated variance (in order to remove sky 
residuals and other artifacts remaining from the spectral 
reduction). 

3. PRINCIPAL COMPONENT ANALYSIS 

The spectral classification presented here is based upon 
a Principal Component Analysis (PCA) of the DEEP2 
galaxy spectra. PCA is a powerful technique, allowing us 
to easily visualize and quantify a multi-dimensional popu- 
lation in terms of just a handful of significant components. 
It does this by identifying the components of a data set 
(in this case the galaxy spectra) which are the most dis- 
criminatory between each galaxy (where the significance 
is determined in terms of its contribution to the variance 
over the entire sample). This allows us to identify just the 
most significant components for future use. It is clear from 
such a formalism that any clustering in the PCA space is 
indicative of distinct sub-populations within the sample. 

PCA has been used with considerable success by a va- 
riety of authors to deal with large multi-dimensional data 
sets. Several similar mathematical formulations of PCA 

7 see |http: //ww.ucolick. org/~ ripis c/Gol 200ThtmTl 
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Fig. 1. — The mean rest-frame spectrum for all galaxies (z > 0.6) observed to date in the DEEP2 Redshift Survey. The top panel shows the 
mean spectrum in units of normalized flux, while the bottom panel shows how many galaxies have contributed to each wavelength channel 
(which is determined by the redshift and wavelength coverage of each galaxy) . Also shown (dotted lines) are the wavelength ranges considered 
in our PCA analyses, one spans Cutl to Cut 2 (3700 - 4200A), the second Cut 1 to Cut 3 (3700 - 5050A). 
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Fig. 2. — The first two projections (pi and P2) are shown for 
both the PCA analyses. The top two panels show the projections 
for the PCA defined using only the 3700-4200 Awavelength range, 
while the bottom cover 3700-5050 A. In each case the left panel 
shows the projections for only those galaxies which span the full 
rest-frame wavelength range used, whereas the right panel shows 
the projections for all DEEP2 galaxies, regardless of their restframe 
wavelength coverage. It can be seen that the PCA is consistent in 
both samples for the shorter wavelength range adopted, but not for 
the larger range. This is also found to be true of higher principal 
components pz, P4 etc. 



for galaxy spectra have appeared in the literature, in par- 
ticular Connolly et al. (1995); Bromley et al. (1998); 
Galaz & de Lapparent (1998); Folkes et al. (1999). The 
most significant difference between the various techniques 
is that some methods utilize mean-subtracted covariance 
matrices for the PCA (by subtracting the mean of the 
normalized spectrum of each galaxy), while others do not. 
This has little effect on the subsequent analysis since the 
latter methods simply yield a mean spectrum as the first 
component. 

In this paper we present a new variation on previous 
formalisms for carrying out the PCA of galaxy spectra, 
designed to accommodate the features of our data set that 
make the standard method very difficult to apply. In par- 
ticular there are two complications that must be addressed 
to analyze the DEEP2 galaxy spectra: the first is the very 
high resolution of the observed spectra which require ex- 
tremely time-consuming matrix computations. The sec- 
ond is that the observed spectra cover very different rest- 
frame wavelength ranges once de-redshifted, and also have 
a number of effective 'gaps' present due to the presence 
of strong night-sky lines. For this reason a method of 
performing interpolations and (small) extrapolations is re- 
quired to ensure the classification is uniform. 

3.1. Implementing PCA for high-resolution data 

A significant drawback to implementing PCA on large 
or very high-dimensional data sets is the required com- 
putation time, particularly to determine the covariance 
matrix. For n galaxies - each with p spectral channels - 
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this requires 0(np 2 ) operations. Given that each DEEP2 
spectrum contains O(10 4 ) channels this would be very time 
consuming indeed if the full spectral resolution is used. 

Fortunately, it is possible to solve for the PCA eigen- 
spectra without calculating the entire covariance matrix. 
The key is developing a probabilistic formalism for the 
PCA, compatible with an expectation-maximization (E- 
M) algorithm (see Roweis 1997 and Tipping & Bishop 
1999). Adopting such a formalism allows one to solve for 
the first k eigenspectra in only 0(npk) operations. It has 
been shown that this method for performing PCA is ro- 
bust, in that it has only one stationary point that is not a 
saddle point, which guarantees there will be no false con- 
vergences (Tipping & Bishop 1999). 

Computationally, the E-M algorithm proceeds in iter- 
ations of two steps. The k eigenvectors to be calculated 
are assumed to be spanned by the columns of a p x k ma- 
trix C. We start by making an initial (random) guess for 
the columns of this matrix and use this to determine the 
k x n matrix, X, of k 'states' for each galaxy. These states 
correspond to the principal component projections of each 
galaxy in the non-orthogonal space defined by the columns 
of C. These states are then used in conjunction with the 
original p x n data matrix, Y, to make a better estimate 
for C. This proceeds until convergence. These steps can 
be summarized as, 



1. Step I: 



2. Step II: 



X = (C^Cr^Y 



YX T (XX T )" 1 



Once the algorithm converges the columns of C will span 
the space of the first k eigenvectors. This can therefore 
be used to construct the orthogonal basis that defines 
the principal components and their projections. We note 
that this method for PCA gives identical principal com- 
ponents to the other (simpler) formalisms discussed previ- 
ously. The sole reason we adopt an E-M based approach is 
to make the PCA efficient for the large and very high res- 
olution data set we are using. Throughout the remainder 
of this paper we will denote the eigenvectors determined 
by the PCA (herein eigenspectra) as Pi,P2 etc. and the 
projections onto these axes by p\, P2 etc. 

3.2. Dealing with incomplete data in the PCA 

The treatment of incomplete data in PCA has already 
been discussed in the literature, e.g. Everson & Sirovich 
(1995) and Connolly & Szalay (1999). However, the imple- 
mentation is complicated and deserves further discussion. 
The issues that must be resolved are two- fold: First, how 
to determine the eigenspectra (or principal components) 
when very few galaxies cover the full wavelength range, 
and second, how to project a galaxy onto these eigenspec- 
tra when its spectrum does not cover the entire restframe 
wavelength range considered. It can be shown that the 
latter issue has a relatively straightforward, clean solution 
involving de-correlating the eigenspectra (which are not 
orthogonal when we do not use their entire wavelength 
coverage). This has been presented in Connolly & Szalay 
(1999). However, estimating the eigenspectra in the first 
place is much more difficult to address. 



Number of components used 



i 


i 






-j-U — ^— -, — 


1 1 T < i i' ' t 




V i,' "' ! " ! — i — - 


yk v * Kj| - 

Y'' 






I j : 











3700 3000 



3900 4000 
Wavelength (A) 



Fig. 3. — The significance of each principal component is shown 
in the top panel, where the improvement in the \ 2 difference be- 
tween the galaxy spectrum, /'(A), and its reconstruction from the 
first i components, ^^PiPi nas been plotted for increasing i. The 
lower panel shows how well a set of randomly chosen galaxy spec- 
tra (dotted lines) can be reconstructed using only the mean DEEP2 
spectrum and the first principal component, pi (solid lines). This 
highlights the fact that although Pi appears to only encapsulate 
information about the nebular emission features (Fig.|lJ, it does in 
fact also contain enough information to reconstruct galaxy spectra 
without these features. 



5 



Consider the situation when the eigenspectra, Pi, have 
already been estimated using the PCA. Each of the ob- 
served, mean-subtracted spectra, f = f — f , can then be 
expressed as a linear combination of these eigenspectra, 

f = ^p i P i) (1) 

i 

where f is the mean spectrum of all the galaxies and pi = 
i' Pi is the projection of each galaxy onto the ith principal 
component. The index i ranges from 1 to the number of 
resolution elements in each spectrum. However, because 
most of the variance in each spectrum is contained in only 
the first few eigenspectra (as will be demonstrated later), 
the sum in Eqn. ^ c an be truncated to include only the 
first k most significant elements. 

We wish to determine how to derive the projections pi 
when the spectrum, f, does not cover the full restframe 
wavelength range spanned by the eigenspectra (e.g. be- 
cause of poor sky subtraction or the effects of redshift). 
The eigenspectra are defined to be orthonormal, 

Pi • Pj = &a ■ (2) 

However, when we only consider the wavelength range over 
which the observed spectrum, f, is defined this is no longer 
the case. We denote the incomplete spectrum by wf, 
where w is a vector with non-zero elements only where f 
is defined. As shown in Connolly & Szalay (1999), the cor- 
relations between the portions of the eigenspectra relevant 
for /' can be quantified in terms of a correlation matrix, 

Mij = vvP, • P, . (3) 

Using this matrix it is possible to show that, when a spec- 
trum does not cover the entire wavelength range of the 
eigenspectra, unbiased estimates of the true projections 
can be calculated simply using the inverse of this correla- 
tion matrix, M, 

Pi = E M « lp i' ( 4 ) 

j 

where again this sum need only be carried out over the 
most significant principal components (in our case we will 
use the first k = 20). In what follows we use this procedure 
for two separate purposes: to estimate the corrected pro- 
jections, pi, when the wavelength range of the galaxy does 
not fully span the eigenspectra, and (using these corrected 
projections) to interpolate over any gaps in a spectrum 
resulting from sky-subtraction etc., using Eqn. ^ This is 
discussed further in the next section. 

Before this formalism can be adopted one first requires 
an estimate of the eigenvectors of the PCA, P^. Several 
approaches to estimating these eigenspectra in cases where 
few or no galaxies cover the full wavelength range of inter- 
est have been suggested. Everson & Sirovich (1995) discuss 
approaches involving extrapolation of the observed spec- 
tra followed by iteration. Another approach is to make 
a least-squares generalization to the E-M PCA (Roweis 
1997). Each of these methods assumes that the 'gaps' 
in each spectrum are randomly positioned, which may be 
true of low-z galaxy spectra but is clearly unsatisfactory 
for high-z surveys since almost all of the observed spectra 



will miss a substantial amount of either the extreme blue 
or extreme red end of the full rest-frame wavelength range. 
For this reason a simple approach to estimating the eigen- 
spectra will normally fail, since it is difficult to correlate 
(and hence extrapolate) one end of the observed spectrum 
with what would be expected at the other end. 

A more practical solution, which we follow in this pa- 
per, is to restrict the rest-frame wavelength range of in- 
terest such that there is a more significant subsample of 
galaxies which span the entire wavelength range of inter- 
est. Two such divisions are outlined in Fig. ^ the first in- 
volves considering only the rest-frame wavelength range of 
3700-4200A (covering [On] and beyond the 4000A break) 
while the second is extended to 3700-5050 A (essentially 
covering [On] to [Oiii]). There is an obvious draw-back 
to this approach in that we are not making use of all the 
information contained in our observed spectra; however, 
as will be shown below this is a necessary compromise. 

3.3. Limited A PCA 

Using the first wavelength cut (3700-4200A), - 50% of 
our galaxy sample span the entire rest-frame wavelength 
range (with the exception of small gaps masked out due 
to bad sky subtraction, gaps between the CCDs and bad 
pixels) and can be used in a PCA analysis. For the sec- 
ond wavelength cut (3700-5050A), we are using a greater 
portion of the observed galaxy spectra and so the analysis 
is potentially more informative; however, only ~ 10% of 
galaxies span this entire wavelength range. 

In each case the procedure we adopt is as follows: 

1. The subset of our galaxies which span the entire 
restframe wavelength range of interest are passed 
through the E-M PCA to determine a first set of 
k = 20 eigenspectra, corresponding to the orthogo- 
nalized columns of the matrix C (c.f. Section 3.1). 

2. Gaps in the galaxy spectra used in the previous 
step are interpolated over by re- calculating the 'de- 
correlated' projections, pi (as described in the pre- 
vious section) and are then used to reconstruct the 
missing parts of the spectrum. The new galaxy spec- 

300 I 1 1 1 n i 1 1 1 1 i i i i i 




-50 50 100 

DEEP2 Spectra] type, T) 

Fig. 5. — The distribution of spectral types, r) = pi, is shown for 
all the galaxies observed to date in the DEEP2 Redshift Survey. It 
can be seen that there is a slight bimodality about r\ = — 13 and for 
this reason we adopt a division of our galaxies at this point. 
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Fig. 4. — The first three eigenvectors (eigenspectra) derived from the 3700-4200A PCA arc shown (left plot), together with the mean of the 
DEEP2 spectra. The first, Pi, appears to be dominated by the strength of any emission lines present. The second displays artifacts from the 
varying width of these emission features, while the third appears to measure the amplitude of the 4000A break. Note that these eigenspectra 
have been derived from the mean subtracted galaxy spectra, and hence represent the difference between each galaxy spectrum and the mean 
spectrum (right panel) . The lower plot shows a close-up view of the continuum of the mean spectrum and that of the first 3 eigenspectra, in 
which it can be seen that Pi also contains an anti-correlation with the absorption features in each spectrum. It can also be seen how each 
component adds successively less information about the features present in each spectrum. 
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Fig. 6. — The average spectrum of the two different 'types' of galaxies are shown, again in terms of normalized flux. It can be seen that 
the PCA has been very successful at separating different types of galaxy spectra. 
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tra are then re-processed through the PCA to pro- 
duce improved estimates of the PCA eigenspectra. 

3. The new set of eigenspectra is then used to compute 
projections for the entire galaxy sample regardless 
of rest-frame wavelength coverage, by means of the 
de-correlation procedure outlined in Eqn. 0] 

We do not iterate further after projections have been de- 
termined for the entire galaxy sample, since this tends to 
propagate errors through the analysis and can lead to un- 
physical eigenspectra. 

The values of the first two projections derived in this 
manner are plotted in Fig. [3 for both the rest-frame wave- 
length ranges considered. This figure allows us to deter- 
mine whether the PCA is robust beyond the narrow red- 
shift range that corresponds to each of the rest-frame wave- 
length limits adopted. To judge this we have compared in 
each case the distribution of the first two components {p\ 
and P2) for only those galaxies used in the initial PCA 
(the restricted z sample), and those for the entire galaxy 
sample (the unrestricted z sample). Although some small 
degree of systematic offset is expected, the two should be 
broadly similar, which is clearly not the case for the second 
sample used (3700-5050 A). This lack of agreement effec- 
tively highlights the limitations of performing this PCA 
formalism for spectra that do not have a sufficient degree 
of overlap. 

Clearly the extended sample is more interesting, in so 
far as it covers a larger rest-frame wavelength range and 
therefore includes more spectral features. However, Fig. El 
demonstrates that this classification cannot in fact be per- 
formed robustly since too few galaxies cover this full rest- 
frame wavelength range. On the other hand, the more 
restricted wavelength choice (3700-4200A) is much more 
robust. We therefore choose to adopt this latter wave- 
length coverage to define our spectral classification. 

4. SPECTRAL CLASSIFICATION 

PCA is not a classification algorithm, rather, it is sim- 
ply a method of data compression. For this reason an- 
other (usually manual) step must be performed, in which 
the insight gained from this compression is used to divide 
a galaxy sample. This is not necessarily straightforward 
and will usually involve some degree of arbitrariness - es- 
pecially since galaxy spectra appear to span a single con- 
tinuum of types rather than falling into distinct classes. 

By far the most significant output of the PCA is the first 
principal component (pi contains 10 times more variance 
than any other component in our analysis, see Fig. [SJ, 
which appears to be dominated by the nebular emission 
line strengths in a spectrum. In particular the strength of 
the [On] emission feature figures prominently, as shown in 
Fig. Note however that, as demonstrated in Fig- El this 
component does in fact also encapsulate a great deal more 
information about each spectrum and can be used to very 
accurately reconstruct a wide variety of galaxy spectra. 
In particular, because the Ca H&K absorption features 
are inverted in this spectrum, it can act as a classifier for 
both absorption and emission dominated galaxy spectra. 

Other eigenspectra, examples of which are shown in 
Fig. 21 also contain useful information about each spec- 
trum. For example, the second eigenspectrum, P2, ap- 



pears to primarily quantify the width of the various emis- 
sion lines that is not incorporated into the first principal 
component. The component P3 reflects the stellar con- 
tinuum, and in particular the height of the 4000A break 
relative to [On] that is not already encapsulated in the 
first principal component. Note that each of these eigen- 
spectra represent variations in the galaxy spectra relative 
to the mean spectrum and so, for example, the [On] line in 
Pi represents the additional emission line strength present 
in each spectrum, whether this be less than (pi < 0) or 
greater than (pi > 0) the value for the mean spectrum. 
Beyond the fourth principal component the eigenspectra 
become dominated by unphysical features in the galaxy 
spectra. 

Because the first component is so much more discrimina- 
tory between galaxies (in terms of variance) this appears to 
be the most logical (and simplest) classification to adopt, 
particularly since this component alone appears to do well 
at reconstructing a large variety of observed spectra. It is 
additionally reassuring to note that Pi also encapsulates 
the greatest variety of features in each galaxy's spectrum 
(Fig. and so is also potentially the most astrophysically 
interesting component available from the PCA. Note that 
excluding the second component means that we are not 
making use of the detailed line- width information, some of 
which is contained in p2- As the DEEP2 reduction pipeline 
improves more precise linewidth information, and even 
detailed rotation curves, will be extracted from the full 
2-dimensional spectrum of each galaxy (rather than the 
compressed 1-dimensional spectra used here). In so doing 
we will be able to provide a much more accurate charac- 
terization of the kinematic properties of each galaxy that 
will be complimentary to the relative emission/absorption 
line-strengths measured by p\ . 

A similar dominance of the first principal component, 
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Fig. 7. — The distribution of restframe [B — R) color is compared 
to the spectral classification, '//deep- It can be seen that there is 
a correlation between the two, particularly for those galaxies with 
large (B — R) colors. 
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Pi , has been noted in previous analyses of low-z galaxy sur- 
veys, suggesting that the distribution of 'normal' galaxy 
spectra can be well approximated by a one dimensional 
sequence. However, we must be careful to ensure that 
this projection is in fact a robust measure of a galaxy's 
spectrum if we are to use it alone to determine spectral 
type. For example, it has been noted for low-z galaxy 
samples observed through small fiber optic apertures that 
the first principal component was not stable between mul- 
tiple observations of the same object (Bromley et al. 1998; 
Madgwick et al. 2002). We find that this is not the case 
for our most significant projections, which proved to have 
relatively small measurement uncertainties. In particular, 
we measured the value of p\ obtained from the instances 
where a given galaxy spectrum was multiply observed and 
found that the distribution of errors was gaussian with a 
standard deviation of only ^5. This is most likely owing to 
the use of slit-apertures instead of fiber optics. We there- 
fore choose to adopt p\ as our measure of spectral type 
for galaxies observed in the DEEP2 Redshift Survey and 
denote it by, 

77DEEP = Pi ■ (5) 

This notation has been specifically chosen to analogous to 
low redshift samples (e.g. the 2dFGRS) for which a similar 
classification scheme has been adopted. 8 

A histogram of r\ is shown in Fig. [SJ in which a signif- 
icant bimodality appears to be present. Since this is the 
only discontinuity in the distribution of this projection, 
it is natural to divide the sample there (?y ~ —13). The 
presence of this bimodality is interesting in that similar 
effects have been observed with galaxy colors (Strateva et 
al. 2001; Bell et al. 2003; Weiner et al. 2003) and also 
in spectral classifications of other samples such as in the 
2dFGRS (Madgwick et al. 2002). 

Having divided the galaxies into two 'types', we proceed 
to calculate the mean spectrum for each type, shown in 
Fig. Clearly, the difference between these two classes is 
quite pronounced, and is differentiating primarily between 
absorption spectra (associated with older stellar popula- 
tions) and emission dominated spectra (associated with 
recent star-formation activity). Taking this interpreta- 
tion further, it should be possible to directly relate this 
classification to the underlying star-formation activity in 
each galaxy as was done for the 2dFGRS (see Madgwick 
et al. 2003). We return to this point in Sec. |3J Galax- 
ies with values of rj less than —13 will be referred to as 
either 'early type' or passively evolving galaxies, whilst 
those with higher values will be referred to as 'late type' 
or actively star-forming galaxies. 

Figure [7| compares the the restframe (B — R)q colors 
of each galaxy with its spectral type, as defined by our rj 
parameter. It can be seen from this figure that there is a 
correlation between the restframe color and the strength of 
emission/absorption features in each observed spectrum, 
particularly for those galaxies with the largest (B — R)o. 

8 Note that in the 2dFGRS, r) was denned as a linear combination 
of the first two components; p\ and p2- The two components were 
needed because of the additional calibration uncertainties introduced 
by the small fiber apertures used in that survey. Despite this the 
classifications are in fact comparable as that combination was also 
the most statistically significant output of the PCA which was robust 
to the known instrumental uncertainties. 



5. COMPARISON WITH Z = CLASSIFICATION 

In this Section we briefly compare the classification de- 
rived for the DEEP2 Survey with that adopted in the 2dF 
Galaxy Redshift Survey (Colless et al. 2001), which com- 
prises over 200,000 low redshift galaxy spectra. The reason 
we make this particular comparison is that the two classi- 
fication regimes have been similarly defined, in that they 
are both derived from the most significant component of 
the PCA that is robust to the known instrumental uncer- 
tainties (Madgwick et al. 2002). Such a classification for 
the Sloan Digital Sky Survey is also in progress but is not 
yet available. 

Both classification methods provide a continuous pa- 
rameterization of the spectral type of a galaxy, denoted 
by rj. At a practical level, this spectral classification is 
simply a dot-product between each observed galaxy spec- 
trum and the chosen classifying eigenspectra, which are 
shown in Fig. [S] for the two surveys. It can be seen from 
this figure that despite the very different selection criteria 
in the two surveys, the chosen classifications both sample 
the same spectral features in the same relative propor- 
tions over their common wavelength interval. Hence the 
two classifications should be very comparable for broad 
studies of galaxy evolution. 

The distributions of the rj spectral types are also shown 
in Fig. [H] Note however that because the DEEP2 sur- 
vey spectra have much higher resolution, the locus of pro- 
jections for this survey spans a much larger range (since 
each spectrum has been normalized to have mean flux of 
unity regardless of resolution). For this reason we have 
scaled the ?72dF spectral parameter of the 2dFGRS sur- 
vey to match t/deep in this comparison. We find that 
multiplying by a factor of 8 suffices to match the two 
classifications, a similar factor would be derived by not- 
ing that early type galaxies are separated from late types 
at 772dF = —1-4 (Madgwick et al. 2002), as opposed to 
?7deep = —13. 

Figure [5] shows a comparison of the average spectra for 
early and late type galaxies as calculated from the 2dF- 
GRS and DEEP2 spectra (with the latter smoothed to the 
9 A resolution of the former). This figure again highlights 
that the correspondence between the two classifications is 
quite striking, despite being derived over different wave- 
length ranges and at different resolutions, they essentially 
encapsulate the same physical information. 

It has already been shown (Madgwick et al. 2003) that 
the spectral classification 77 adopted for the 2dFGRS cor- 
responds most naturally to the relative amount of star 
formation activity currently occurring in each galaxy as 
compared with its past average (the Scalo birthrate, b, pa- 
rameter, Scalo 1986). Objects with high 77 values are galax- 
ies with particularly strong recent star formation, whereas 
the lowest-?/ sample of galaxies has b < 0.1 (i.e. their cur- 
rent star formation rate is only 10% of their past averaged 
value). Because there is such a simple one-to-one corre- 
spondence between this spectral classification and the star 
formation activity of each galaxy, we can confirm from 
Fig.|Hlour intuition that selecting galaxies using restframe 
U magnitudes (as is the case for the DEEP2 Survey, see 
e.g. Weiner et al. 2003) is very much more biased towards 
galaxies with recent episodes of star-formation than the B 
selection adopted for the 2dFGRS. This result will have 
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Wavelength Matched spectral classification, rj 



Fig. 8. — Comparison between the eigenspectra used to classify the DEEP2 and 2dFGRS spectra (left panel). It can be seen from this 
comparison that the two eigenspectra are using identical features to classify the galaxies in each survey. The distributions of the two spectral 
types are shown in the right panel (after scaling the 2dFGRS rj to take account of the different resolutions). The surveys comprise different 
types of galaxies based upon their individual selection criteria, for example it can be seen that the DEEP2 Survey appears to contain relatively 
more 'late-type' galaxies. 



important repercussions for the interpretation of future 
analyses of the DEEP2 Survey data. 

A more detailed assessment of the exact relative fre- 
quency of star forming galaxies in each survey will be forth- 
coming in a future paper, in which the selection function of 
each survey will be fully incorporated. Once this is avail- 
able it will also be particularly interesting to experiment 
with how consistent the evolution between the populations 
is with that expected from spectral synthesis models (e.g. 
Bruzual & Chariot 1993; Fioc & Rocca-Volmerange 1997). 

6. CONCLUSIONS 

In this paper we have presented a new PCA-based spec- 
tral classification, ^deep, for the galaxies observed to date 
in the DEEP2 Redshift Survey. The main goal in develop- 
ing this classification was to provide a consistent and ro- 
bust measure of the type of a galaxy over the large range 
of redshifts encountered in this survey. To do this spe- 
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Fig. 9. — The average spectrum of the early and late type galaxies 
from the 2dFGRS and DEEP2 surveys are compared. The DEEP2 
galaxy spectra (solid line) have been smoothed using an 8A filter in 
order to match the resolution of the 2dFGRS spectra. 



cial handling of incomplete and 'gappy' observed spectra 
is required. 

Although this classification, ^deeP: appears to primar- 
ily identify galaxies with differing strengths of nebular 
emission, Fig. [3] demonstrates that this component alone 
can reconstruct the spectra of a wide variety of galaxies 
extremely well. 

We have not directly related this classification to the role 
of star formation activity in a given galaxy However, there 
is strong evidence from a similar analysis carried out for 
the 2dFGRS that 7/deep should also correlate well with 
the relative amount of star formation. A more detailed 
study of this correlation, together with the role of higher 
order PCA projections, will be forthcoming when higher 
resolution spectral synthesis models become publicly avail- 
able in the near future (the DEEP2 galaxy spectra are at 
a resolution that is ~ 20 times higher than any synthesis 
model currently available). 

This particular classification will be particularly useful 
in subsequent analyses e.g. of the galaxy luminosity func- 
tion or correlation functions, as it is easily comparable 
with other classifications at z = 0, for which large spec- 
troscopic samples are now publicly available. In addition, 
previous work (e.g. Madgwick et al. 2003) has shown 
that it is straightforward to make direct comparisons be- 
tween such a spectral classification regime and the output 
of semi-analytic galaxy models (e.g. Kauffmann, White & 
Guiderdoni 1993; Cole et al. 1994; Somerville & Primack 
1999), allowing us to directly constrain the assumed mod- 
els of galaxy formation and evolution between z = 1 and 
z = using this form of classification. 

When used in conjunction with spectral synthesis mod- 
els, we expect that parameters similar to the one pre- 
sented here will provide a wealth of information on both 
the amount and 'type' of evolution that has occurred in the 
galaxy population. This is particularly relevant now that 
such large samples of galaxy spectra are becoming avail- 
able over such a wide range of redshifts. For this reason 
we expect studies of the evolution in the galaxy population 
to become a particularly rich field of research in the near 
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future, and that the DEEP 2 Redshift Survey will play an 
especially prominent role in this field. 
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